Arabic Diacritic Recovery Using a Feature-rich biLSTM Model

نویسندگان

چکیده

Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them correctly pronounce words. There two types of diacritics: The first core-word diacritics (CW), which specify the lexical selection, second case endings (CE), appear at end word stems generally their syntactic roles. Recovering CEs is relatively harder than recovering due inter-word dependencies, often distant. In this article, we use feature-rich recurrent neural network model that a variety linguistic surface-level features recover both core endings. Our surpasses all previous state-of-the-art systems with CW error rate (CWER) 2.9% CE (CEER) 3.7% for Modern Standard (MSA) CWER 2.2% CEER 2.5% Classical (CA). When combining diacritized cores endings, resultant rates 6.0% 4.3% MSA CA, respectively. This highlights effectiveness feature engineering such deep models.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arabic Text Representation using Rich Semantic Graph: A Case Study

Representing Arabic Text semantically using Rich Semantic Graph (RSG) is one of the recent techniques that facilitate the process of manipulating the Arabic Language in Natural Language Processing (NLP) field. The work presented in this paper is a part of an ongoing research to create an abstractive summary for a single input document in Arabic Language. The abstractive summary is generated thr...

متن کامل

Off-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model

In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...

متن کامل

Answer Selection in Arabic Community Question Answering: A Feature-Rich Approach

The task of answer selection in community question answering consists of identifying pertinent answers from a pool of user-generated comments related to a question. The recent SemEval-2015 introduced a shared task on community question answering, providing a corpus and evaluation scheme. In this paper we address the problem of answer selection in Arabic. Our proposed model includes a manifold o...

متن کامل

a simple circuit model showing feature-rich Bogdanov-Takens bifurcation

A circuit model is proposed for studying the global behavior of the normal form describing the Bogdanov-Takens bifurcation, which is encountered in the study of autonomous dynamical systems arising in different branches of science and engineering. The circuit is easy-to-implement and one can experimentally study the rich dynamics and bifurcations simply by altering the values of some linear cir...

متن کامل

A Feature-Rich Constituent Context Model for Grammar Induction

We present LLCCM, a log-linear variant of the constituent context model (CCM) of grammar induction. LLCCM retains the simplicity of the original CCM but extends robustly to long sentences. On sentences of up to length 40, LLCCM outperforms CCM by 13.9% bracketing F1 and outperforms a right-branching baseline in regimes where CCM does not.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Asian and Low-Resource Language Information Processing

سال: 2021

ISSN: ['2375-4699', '2375-4702']

DOI: https://doi.org/10.1145/3434235